Skip to content

refactor: give parquet CDC options an explicit enabled flag#22632

Open
kszucs wants to merge 2 commits into
apache:mainfrom
kszucs:cdc-options-align-parquet-rs
Open

refactor: give parquet CDC options an explicit enabled flag#22632
kszucs wants to merge 2 commits into
apache:mainfrom
kszucs:cdc-options-align-parquet-rs

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented May 29, 2026

Which issue does this PR close?

  • None

Rationale for this change

The CDC options currently work as use_content_defined_chunking: Option<CdcOptions> with a ConfigField impl that accepts a bare use_content_defined_chunking = true|false and otherwise enables CDC implicitly when any sub-field is set. This has a few problems:

  • Naming diverges from parquet-rs. WriterProperties exposes content_defined_chunking() / set_content_defined_chunking(Option<CdcOptions>) with no use_ prefix.
  • Implicit / order-dependent on the SQL side. Format options in COPY ... OPTIONS / CREATE EXTERNAL TABLE ... OPTIONS are applied from a HashMap (non-deterministic order). With the old bare-boolean form, mixing ... = false with a sub-field, or setting a sub-field after = false, could resolve to enabled or disabled depending on iteration order.
  • Extra machinery. Supporting the bare boolean required a hand-written impl ConfigField for CdcOptions + impl ConfigField for Option<CdcOptions> and a #[expect(clippy::should_implement_trait)] workaround, plus a zero-sentinel fallback in the proto mapping.

Since CDC is unreleased, the config/proto surface can still be changed freely.

What changes are included in this PR?

  • Rename the ParquetOptions field use_content_defined_chunking -> content_defined_chunking (matches parquet-rs).
  • Make CdcOptions a plain config_namespace! with an explicit enabled: bool field alongside the chunking parameters; the field is a bare CdcOptions (no longer Option<CdcOptions>). CDC is on if content_defined_chunking.enabled is true. Setting a parameter no longer implicitly enables CDC, and the result is independent of key order.
  • Add CdcOptions::enabled() / CdcOptions::disabled() shorthand constructors.
  • Drop the ConfigField impls and the should_implement_trait workaround — all generated by the macro now.
  • Add an enabled field to the proto CdcOptions message so the proto <-> config mapping is a plain field copy in both directions (removes the presence-encoding and the zero-sentinel fallback).
  • Update unit tests, regenerate config docs + the information_schema snapshot, and add parquet_cdc_config.slt documenting the resolution behavior.

Are these changes tested?

Yes:

  • datafusion-common config + writer unit tests (enable toggle, parameter-does-not-enable, validation, writer round-trip).
  • datafusion-proto-common proto round-trip tests (enabled / disabled / negative norm level).
  • datafusion/core parquet integration tests (data round-trip, page boundaries).
  • sqllogictest: parquet_cdc.slt (end-to-end) and a new parquet_cdc_config.slt (config resolution / order independence).

Are there any user-facing changes?

Yes, but only to the unreleased CDC options:

  • Config key datafusion.execution.parquet.use_content_defined_chunking -> datafusion.execution.parquet.content_defined_chunking.enabled (plus .min_chunk_size / .max_chunk_size / .norm_level).
  • The bare-boolean form is removed; enable/disable via content_defined_chunking.enabled = true|false.

No released API is affected.

🤖 Generated with Claude Code

@github-actions github-actions Bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels May 29, 2026
@kszucs kszucs force-pushed the cdc-options-align-parquet-rs branch from 3e66f6e to 30090c1 Compare May 29, 2026 23:37
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented May 29, 2026

I had second thoughts about the cdc options configuration and actually found some weird implicit behavior (e.g. order dependent configuration), so I switched to a more verbose but explicit one (I also plan to follow this approach in iceberg).

@alamb could we include it in the release so we don't need to break the API later on?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 29, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  78.235s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.029s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  78.638s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.662s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 159.291s] datafusion
    Building datafusion-common v53.1.0 (current)
       Built [  26.699s] (current)
     Parsing datafusion-common v53.1.0 (current)
      Parsed [   0.049s] (current)
    Building datafusion-common v53.1.0 (baseline)
       Built [  26.983s] (baseline)
     Parsing datafusion-common v53.1.0 (baseline)
      Parsed [   0.051s] (baseline)
    Checking datafusion-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.712s] 222 checks: 219 pass, 3 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ParquetOptions.content_defined_chunking in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:806

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/struct_missing.ron

Failed in:
  struct datafusion_common::config::CdcOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/0d7a1f79c8e4aedb5d8b530e212285dc0bc6757c/datafusion/common/src/config.rs:766

--- failure struct_pub_field_missing: pub struct's pub field removed or renamed ---

Description:
A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/struct_pub_field_missing.ron

Failed in:
  field use_content_defined_chunking of struct ParquetOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/0d7a1f79c8e4aedb5d8b530e212285dc0bc6757c/datafusion/common/src/config.rs:890

     Summary semver requires new major version: 3 major and 0 minor checks failed
    Finished [  55.368s] datafusion-common
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  37.367s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.024s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  38.133s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.025s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.142s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  76.619s] datafusion-datasource-parquet
    Building datafusion-proto v53.1.0 (current)
       Built [  48.872s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.016s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  48.755s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.017s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.274s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  99.270s] datafusion-proto
    Building datafusion-proto-common v53.1.0 (current)
       Built [  17.261s] (current)
     Parsing datafusion-proto-common v53.1.0 (current)
      Parsed [   0.040s] (current)
    Building datafusion-proto-common v53.1.0 (baseline)
       Built [  17.275s] (baseline)
     Parsing datafusion-proto-common v53.1.0 (baseline)
      Parsed [   0.042s] (baseline)
    Checking datafusion-proto-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.139s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/struct_missing.ron

Failed in:
  struct datafusion_proto_common::generated::datafusion_proto_common::CdcOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/0d7a1f79c8e4aedb5d8b530e212285dc0bc6757c/datafusion/proto-common/src/generated/prost.rs:978
  struct datafusion_proto_common::protobuf_common::CdcOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/0d7a1f79c8e4aedb5d8b530e212285dc0bc6757c/datafusion/proto-common/src/generated/prost.rs:978
  struct datafusion_proto_common::CdcOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/0d7a1f79c8e4aedb5d8b530e212285dc0bc6757c/datafusion/proto-common/src/generated/prost.rs:978

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  36.624s] datafusion-proto-common
    Building datafusion-proto-models v53.1.0 (current)
       Built [  19.638s] (current)
     Parsing datafusion-proto-models v53.1.0 (current)
      Parsed [   0.103s] (current)
    Building datafusion-proto-models v53.1.0 (baseline)
       Built [  20.201s] (baseline)
     Parsing datafusion-proto-models v53.1.0 (baseline)
      Parsed [   0.108s] (baseline)
    Checking datafusion-proto-models v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.669s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure struct_missing: pub struct removed or renamed ---

Description:
A publicly-visible struct cannot be imported by its prior path. A `pub use` may have been removed, or the struct itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/struct_missing.ron

Failed in:
  struct datafusion_proto_models::generated::datafusion::CdcOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/0d7a1f79c8e4aedb5d8b530e212285dc0bc6757c/datafusion/proto-models/src/generated/datafusion_proto_common.rs:978
  struct datafusion_proto_models::protobuf::CdcOptions, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/0d7a1f79c8e4aedb5d8b530e212285dc0bc6757c/datafusion/proto-models/src/generated/datafusion_proto_common.rs:978

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  42.915s] datafusion-proto-models
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 135.056s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.019s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 134.470s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.092s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 272.200s] datafusion-sqllogictest

Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs
This looks good to me overall. I left a couple of small cleanup suggestions, but nothing blocking.

Comment thread datafusion/common/src/file_options/parquet_writer.rs Outdated
norm_level: cdc.norm_level,
}
),
content_defined_chunking: Some(protobuf::CdcOptions {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work wiring this through. One small maintainability idea: the CdcOptions to proto field mapping now appears here and in proto/src/logical_plan/file_formats.rs. It might be worth adding a small helper conversion, such as impl From<&CdcOptions> for protobuf::CdcOptions if that fits the crate boundaries, so future CDC field changes only need one mapping update per direction.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosiew apparently those generate parquet proto types are different in logical_plan/file_formats.rs (coming from proto-models) and in proto-common. The protobuf serialization a bit tangled, let me know if I'm missing something.

@kszucs kszucs force-pushed the cdc-options-align-parquet-rs branch from 5b2209e to ca4a2bf Compare June 1, 2026 16:21
Content-defined chunking (CDC) write options were added in apache#21110 and have
not been released yet (current workspace is 53.x; CDC is slated for 54.0.0),
so the config and proto surfaces can still be changed freely. This reworks it
before it ships.

What changes:

* Rename the `ParquetOptions` field `use_content_defined_chunking` ->
  `content_defined_chunking`.
* `CdcOptions` becomes a plain `config_namespace!` with an explicit
  `enabled: bool` field alongside the chunking parameters, and the field is a
  bare `CdcOptions` (no longer `Option<CdcOptions>`). CDC is on iff
  `content_defined_chunking.enabled` is true. Add `CdcOptions::enabled()` /
  `CdcOptions::disabled()` shorthand constructors.
* Drop the bespoke `impl ConfigField for CdcOptions` /
  `impl ConfigField for Option<CdcOptions>` and the
  `#[expect(clippy::should_implement_trait)]` workaround that backed the old
  bare-boolean form. Everything is now generated by the macro.
* Add an `enabled` field to the proto `CdcOptions` message so the proto <->
  config mapping is a direct field copy, dropping the previous
  presence-encoding and the zero-sentinel fallback for the chunk sizes.

Why this is better:

* Naming matches parquet-rs. parquet's `WriterProperties` exposes
  `content_defined_chunking()` / `set_content_defined_chunking(...)` with no
  `use_` prefix; the field name now lines up across the boundary.

* Explicit, not magic. CDC is toggled with a real
  `content_defined_chunking.enabled = true|false` key instead of a special
  bare-boolean parse, and setting a chunking parameter no longer silently turns
  CDC on.

* No order-dependence on the SQL side. Format options in `COPY ... OPTIONS`
  and `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap`, i.e. in
  non-deterministic order. With a separate `enabled` flag, the flag and the
  parameters are set independently, so the resolved config never depends on the
  order in which the keys happen to be applied.

* Simpler. No hand-written `ConfigField` impls, no clippy hack, and the proto
  serialization is a plain field copy in both directions.

Tests, generated config docs, and the information_schema snapshot are updated
accordingly; a new `parquet_cdc_config.slt` documents the resolution behavior
(enable toggle, parameter-does-not-enable, order independence).
@kszucs kszucs force-pushed the cdc-options-align-parquet-rs branch from ca4a2bf to 66eef56 Compare June 1, 2026 16:22
Follow-up refinements to the parquet CDC options (all unreleased):

* Rename the config struct `CdcOptions` -> `ParquetCdcOptions` for explicitness
  and consistency with the other parquet sub-option structs
  (`ParquetColumnOptions`, `ParquetEncryptionOptions`), and to disambiguate from
  the unrelated "change data capture" meaning of CDC.
* Drop the chunk-size validation in the parquet writer path (parquet-rs enforces
  the bounds) and gate the writer call on the `enabled` flag.
* Add `From` conversions for the config <-> parquet-rs and config <-> proto
  `CdcOptions` mappings, replacing the inline field copies. parquet-rs has no
  `enabled` flag, so the conversion encodes enabled <-> Option presence.
* Rename the proto message to a top-level `ParquetCdcOptions` so config and proto
  names line up; field tags are unchanged, so the wire format is unaffected.

Regenerated prost/pbjson for proto-common and proto-models.
@kszucs kszucs force-pushed the cdc-options-align-parquet-rs branch from 66eef56 to c7ccb83 Compare June 1, 2026 16:46
kszucs added a commit to kszucs/datafusion that referenced this pull request Jun 1, 2026
Backport of the follow-up to apache#22632, adapted to branch-54 (which predates the
proto-models split and the FromProto/TryFromProto refactor):

* Rename the config struct `CdcOptions` -> `ParquetCdcOptions` for explicitness
  and consistency with the other parquet sub-option structs
  (`ParquetColumnOptions`, `ParquetEncryptionOptions`), and to disambiguate from
  the unrelated "change data capture" meaning of CDC.
* Drop the chunk-size validation in the parquet writer path (parquet-rs enforces
  the bounds) and gate the writer call on the `enabled` flag, via `From`
  conversions between the config type and parquet-rs's `Option<CdcOptions>`.
* Add `From` conversions for the config <-> proto `CdcOptions` mapping in
  proto-common, replacing the inline field copies.
* Rename the proto message to `ParquetCdcOptions`; field tags are unchanged, so
  the wire format is unaffected.

Regenerated prost/pbjson for proto-common and proto.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants